“As I scurried across the candlelit chamber, manuscripts in hand, I thought I’d made it. Nothing would be able to hurt me anymore. Little did I know there was one last fright lurking around the corner.” This is a part of a horror story which terrifies and excites us. In the passage, we are going to analyze the data composed of horror stories written by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. The data was prepared by chunking larger texts into sentences using CoreNLP’s MaxEnt sentence tokenizer. Specifically we would like to consider the similarities and the differences between the texts attributed to each author and study patterns that could be used to characterize the writing styles of the three authors.
packages.used=c("widyr","ggraph","igraph","stringr","scales","spacyr","cleanNLP","readr","stringi","ggplot2","corrplot","dplyr","tidyr","forcats","reshape2","ggridges","corrgram","textstem","tidytext","tm","topicmodels","wordcloud","RSentiment","jpeg")
# check packages that need to be installed.
packages.needed=setdiff(packages.used,
intersect(installed.packages()[,1],
packages.used))
# install additional packages
if(length(packages.needed)>0){
install.packages(packages.needed, dependencies = TRUE)
}
library('jpeg')
library('widyr')
library('ggraph')
library('igraph')
library('stringr')
library('scales')
library('spacyr')
library('cleanNLP')
library('readr')
library('stringi')
library('ggplot2')
library('corrplot')
library('dplyr')
library('tidyr')
library('forcats')
library('reshape2')
library('ggridges')
library('corrgram')
library('textstem')
library('tidytext')
library('tm')
library('topicmodels')
library('wordcloud')
library("RSentiment")
# Models
source("../lib/multiplot.R")
First, before we dive into the data. Let’s take a glimpse of the data offered.
spookydata = read.csv('../data/spooky.csv', as.is = TRUE)
Then, we need to pre-precess the text data.
spookydata <- spookydata %>%
filter(str_detect(text, "^[^>]+[A-Za-z\\d]") | text == ""
)
Then, let’s find whether the sentence length among the authors will vary much.
p <- spookydata %>%
mutate(sen_len = str_length(text)) %>%
ggplot(aes(sen_len, author, fill = author)) +
geom_density_ridges() +
scale_x_log10() +
theme(legend.position = "right") +
labs(x = "Sentence length")
#jpeg(file="../figs/sentence.jpeg")
#plot(p)
#dev.off()
plot(p)
## Picking joint bandwidth of 0.0414
Looks like the three authors’ sentence length distribution varies. HP Lovecraft prefers long sentence and is more focused on using sentences with length around 200.
Second, let’s do some simple treatment to our data: remove the invalid information incluing tokens. Also, we could do the lemmatization to the words.
spooky_wrd <- lemmatize_words(spookydata) %>%
unnest_tokens(word, text) %>%
# remove stopwords
anti_join(stop_words, by = "word") %>%
count(author, word) %>%
ungroup()
In this part, lets’ make a word cloud to see the most common words used by the three authors together and separately
spooky_wrd_all <- spooky_wrd %>%
group_by(word) %>%
summarise(n = sum(n)) %>%
ungroup()
#jpeg(file="../figs/wordcloud_1.jpeg")
#wordcloud(spooky_wrd_all$word, spooky_wrd_all$n,
# max.words = 200, scale = c(2.0,0.5),
# colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
#dev.off()
wordcloud(spooky_wrd_all$word, spooky_wrd_all$n,
max.words = 200, scale = c(2.0,0.5),
colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
Those words are the most common words in the datasets, we have seen that the authors would like to use the word like “life”,“death”,“door” and “light”. It seems that the horror fictions would like to decorate the normal life with life and death. Naturally, we would assume some diffrence between different authors. Now, let’s see if there is any difference between them.
spooky_wrd_MWS <- spooky_wrd %>%
filter(author == "MWS") %>%
group_by(word) %>%
summarise(n = sum(n)) %>%
ungroup()
#jpeg(file="../figs/wordcloud_MWS.jpeg")
#wordcloud(spooky_wrd_MWS$word, spooky_wrd_MWS$n,
# max.words = 200, scale = c(2.0,0.5),
# colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
#dev.off()
wordcloud(spooky_wrd_MWS$word, spooky_wrd_MWS$n,
max.words = 200, scale = c(2.0,0.5),
colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
It seems that Mary Shelley is more focused on the human body and relationships.
spooky_wrd_EAP <- spooky_wrd %>%
filter(author == "EAP") %>%
group_by(word) %>%
summarise(n = sum(n)) %>%
ungroup()
#jpeg(file="../figs/wordcloud_EAP.jpeg")
#wordcloud(spooky_wrd_EAP$word, spooky_wrd_EAP$n,
# max.words = 200, scale = c(2.0,0.5),
# colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
#dev.off()
wordcloud(spooky_wrd_EAP$word, spooky_wrd_EAP$n,
max.words = 200, scale = c(2.0,0.5),
colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
As for Edgar Allan Poe, it seems that he is more concerned about the time. Maybe he is the kind of writer who is good at creating the atmosphere of emergencies.
spooky_wrd_HPL <- spooky_wrd %>%
filter(author == "HPL") %>%
group_by(word) %>%
summarise(n = sum(n)) %>%
ungroup()
#jpeg(file="../figs/wordcloud_HPL.jpeg")
#wordcloud(spooky_wrd_HPL$word, spooky_wrd_HPL$n,
# max.words = 200, scale = c(2.0,0.5),
# colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
#dev.off()
wordcloud(spooky_wrd_HPL$word, spooky_wrd_HPL$n,
max.words = 200, scale = c(2.0,0.5),
colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
Well, HP Lovecraft is different from the other two that he prefers the night. Perhaps, his story happens mostly at night.
After exploring the words, let’s start the sentiment analysis. ## Emotion
spooky_wrd <- lemmatize_words(spookydata) %>% unnest_tokens(word, text)%>%
anti_join(stop_words, by = "word")
Let’s see the sentiment analysis among the different authors.
pic_wrd_mws <-spooky_wrd %>%
filter(author == "MWS") %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(8, n) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip()+
labs( x = NULL,y = "Sentiment Analysis") +
ggtitle("Mary Shelley: Negative Positive Words")
#jpeg(file="../figs/senti_MWS.jpeg")
#plot(pic_wrd_mws)
#dev.off()
plot(pic_wrd_mws)
As for Mary, she is mostly focused on the negative, positive, uncertainty words. And specifically, as for the negative words, she likes “fear”, “lost” and “poor”.Her top 3 positive words are “happiness”,“happy”,“pleasure”. Her top 3 positive words are “appeared”,“suddenly”,“unknown”.
pic_wrd_hpl<-spooky_wrd %>%
filter(author == "HPL") %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(8, n) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip()+
labs( x = NULL,y = "Sentiment Analysis") +
ggtitle("H P Lovecraft: Negative Positive Words")
#jpeg(file="../figs/senti_HPL.jpeg")
#plot(pic_wrd_hpl)
#dev.off()
plot(pic_wrd_hpl)
Similarly, HP Lovecraft is very much like Mary, mainly focusing on the negative, positive and uncertainty words. Also, his top three negative words are: “fear”,“lost” and “recall”. Surprisingly, he shares the “fear” and “lost” with Mary. While for the positive words, his top three words are “dream”,“leading”,“fantastic”. His top three words in uncertainty are “unknown”,“suddenly” and “appeared”.
pic_wrd_eap<-spooky_wrd %>%
filter(author == "EAP") %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(8, n) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip()+
labs( x = NULL,y = "Sentiment Analysis") +
ggtitle("E A Poe: Negative Positive Words")
#jpeg(file="../figs/senti_EAP.jpeg")
#plot(pic_wrd_eap)
#dev.off()
plot(pic_wrd_eap)
Lastly, E A Poe also mainly focuses on the negative, positive and uncertainty words. Also, his top three negative words are: “doubt”,“question” and “difficulty”. Surprisingly, he differs from the previous two authors in negative words. While for the positive words, his top three words are “beautiful”,“easily”,“excited”. His top three words in uncertainty are “doubt”,“appeared” and “suddenly”.
In all, as for the sentiment words, all the authors focuses on the words “positive” “negative” and “uncertainty” while Mary and Lovecraft share favorite some negative words. E A Poe, however uses much different words from the other two authors.
Then, we may visualize the sentiment clustering word cloud as below.
#jpeg(file="../figs/sent_wordcloud.jpeg")
spooky_wrd %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 500)
#dev.off()
To further anlayze the sentiment among the fictions, we could do some numerical analysis. Define the super negative Index = (#uncertainty+#Negative + #constraining)/(#Postive + #Negative + #constraining + #uncertainty)
pic1 <- spooky_wrd %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
ggplot(aes(author, fill = sentiment)) +
geom_bar(position = "fill")
pic2 <- spooky_wrd %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
group_by(author, id, sentiment) %>%
count() %>%
spread(sentiment, n, fill = 0) %>%
group_by(author, id) %>%
summarise(neg = sum(negative),
con = sum(constraining),
unc = sum(uncertainty),
pos = sum(positive)) %>%
arrange(id) %>%
mutate(frac_neg = 1 - pos/(pos + neg + con+unc)) %>%
ggplot(aes(frac_neg, fill = author)) +
geom_density(bw = .3, alpha = 0.5) +
theme(legend.position = "right") +
labs(x = "d")
layout <- matrix(c(1,2),1,2,byrow=TRUE)
#jpeg(file="../figs/super_negative_index.jpeg")
#multiplot(pic1, pic2, layout=layout)
#dev.off()
multiplot(pic1, pic2, layout=layout)
This picture directly reveals the sentiment distribution among the authors. All of the authors focuse most on negative, then positive and uncertainty. While the two male authors’ focus on uncertainty words are more than the female author: Mary.
We have been considering the single words statistics analysis for the fictions. It is interesting to do the n-gram analysis.
As usual, we do the lemmatizing and remove the stopwords.
usenet_bigrams <- lemmatize_words(spookydata) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)%>%
separate(bigram, c("word1", "word2"), sep = " ")%>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)%>%
unite(bigram, word1, word2, sep = " ")
usenet_bigram_counts <- usenet_bigrams %>%
count(author, bigram, sort = TRUE) %>%
ungroup() %>%
separate(bigram, c("word1", "word2"), sep = " ")
usenet_bigram_counts
bigram_tf_idf <- usenet_bigrams %>%
count(author, bigram) %>%
bind_tf_idf(bigram, author, n) %>%
arrange(desc(tf_idf))
#jpeg(file="../figs/tf_idf_authors.jpeg")
bigram_tf_idf %>%
group_by(bigram_tf_idf$author)%>%
top_n(10, tf_idf) %>%
ungroup() %>%
mutate(bigram = reorder(bigram, tf_idf)) %>%
ggplot(aes(bigram, tf_idf, fill = author)) +
geom_col(show.legend = FALSE) +
facet_wrap(~author, scales = "free") +
ylab("tf-idf") +
coord_flip()
#dev.off()
As we can see, Lovecraft and EAP used the words like “ha ha” and “heh heh” most, while Mary pays little attention to them. It may be a major different between the male authors and female authors of horror fictions.
First, let’s explore the structure using Latent Dirichlet Allocation Moddel(LDA). This method yields an unsupervised classifictaion of documents. By seeking the clusters corresponding to different topics, we will be able to find underlting structure in the data.
# divide into documents, each representing one chapter
by_chapter <- spookydata %>%
group_by(author) %>%
#mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() #%>%
#filter(chapter > 0) %>%
#unite(document, author, text)
# split into words
by_chapter_word <- by_chapter %>%
unnest_tokens(word, text)
# find document-word counts
word_counts <- by_chapter_word %>%
anti_join(stop_words) %>%
count(author, word, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
word_counts
chapters_dtm <- word_counts %>%
cast_dtm(author, word, n)
chapters_dtm
## <<DocumentTermMatrix (documents: 3, terms: 24949)>>
## Non-/sparse entries: 40182/34665
## Sparsity : 46%
## Maximal term length: 19
## Weighting : term frequency (tf)
I use the LDA model for topic modelling with potential 10 topics.And save it to the output folder.
k <- 10
chapters_lda <- LDA(chapters_dtm, k = 10,method = "Gibbs", control = list(seed = 1234))
chapters_lda
## A LDA_Gibbs topic model with 10 topics.
chapter_topics <- tidy(chapters_lda, matrix = "beta")
chapter_topics
top_terms <- chapter_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
pic_topic<-top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
#jpeg(file="../figs/topics.jpeg")
#plot(pic_topic)
#dev.off()
plot(pic_topic)
We can now see the ten topics, and the 5 most frequent words in each topic.
chapters_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_gamma
chapters_gamma <- chapters_gamma %>%
separate(document, c("author"), sep = "_", convert = TRUE)
chapters_gamma
topic_a<-chapters_gamma %>%
mutate(author = reorder(author, gamma * topic)) %>%
ggplot(aes(factor(topic), gamma)) +
geom_boxplot() +
facet_wrap(~ author)
#jpeg(file="../figs/topics_author.jpeg")
#plot(topic_a)
#dev.off()
plot(topic_a)
Clearly, in the topic 1 which describes the atmosphere and envronment, Lovecraft is more focused on it than the other two authors.While Edgar focuses on topic 2 and Mary focuses on topic 9. The major differences revealed in this plot best illustrates the difference among the authors.
chapter_classifications <- chapters_gamma %>%
group_by(author) %>%
top_n(1, gamma) %>%
ungroup()
chapter_classifications
book_topics <- chapter_classifications %>%
count(author, topic) %>%
group_by(author) %>%
top_n(1, n) %>%
ungroup() %>%
transmute(consensus = author, topic)
assignments <- augment(chapters_lda, data = chapters_dtm)
assignments <- assignments %>%
separate(document, c("author"), sep = "_", convert = TRUE) %>%
inner_join(book_topics, by = c(".topic" = "topic"))
heat_map<-assignments %>%
count(author, consensus, wt = count) %>%
group_by(author) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(consensus, author, fill = percent)) +
geom_tile() +
scale_fill_gradient2(high = "red", label = percent_format()) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank()) +
labs(x = "Topics assigned to",
y = "Topics came from",
fill = "% of topics")
#jpeg(file="../figs/topics_heatmap.jpeg")
#plot(heat_map)
#dev.off()
plot(heat_map)
It is not hard to see, by the measurement of topics, these authors basically don’t have much in common. So it would be wise to classify the fictions according to the topics.
In all, the three authors vary in the wrting styles in many aspects including the gender, the sentence length, the focused topic and the words focus. However, the three authors are much alike each other in the distribution of emotions of the words. This may be viewed as a pattern for horror fictions.